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ABSTRACT 

In  this  paper  we  present  some  of  the  algorithm  improvements 
that  have  been  made  to  Dragon’s  continuous  speech  recog¬ 
nition  and  training  programs,  improvements  that  have  more 
than  halved  our  error  rate  on  the  Resource  Management  task 
since  the  last  SLS  meeting  in  February  1991.  We  also  report 
the  “dry  run”  results  that  we  have  obtained  on  the  5000-word 
speaker-dependent  Wall  Street  Journal  recognition  task,  and 
outline  our  overall  research  strategy  and  plans  for  the  future. 

In  our  system,  a  set  of  output  distributions,  known  as  the 
set  of  PELs  (phonetic  elements),  is  associated  with  each 
phoneme.  The  HMM  for  a  PIC  (phoneme-in-context)  is  rep¬ 
resented  as  a  linear  sequence  of  states,  each  having  an  out¬ 
put  distribution  chosen  from  the  set  of  PELs  for  the  given 
phoneme,  and  a  (double  exponential)  duration  distribution. 

In  this  paper  we  report  on  two  methods  of  acoustic  modeling 
and  training.  The  first  method  involves  generating  a  set  of 
(unimodal)  PELs  for  a  given  speaker  by  clustering  the  hypo¬ 
thetical  frames  found  in  the  spectral  models  for  that  speaker, 
and  then  constructing  speaker-dependent  PEL  sequences  to 
represent  each  PIC.  The  “spectral  model”  for  a  PIC  is  sim¬ 
ply  the  expected  value  of  the  sequence  of  frames  that  would 
be  generated  by  the  PIC.  The  second  method  represents  the 
probability  distribution  for  each  parameter  in  a  PEL  as  a 
mixture  of  a  fixed  set  of  unimodal  components,  the  mixing 
weights  being  estimated  using  the  EM  algorithm.  In  both 
models  we  assume  that  the  parameters  are  statistically  inde¬ 
pendent. 

We  report  results  obtained  using  each  of  these  two  meth¬ 
ods  (RePELing/Respelling  and  univariate  “tied  mixtures”) 
on  the  5000-word  closed-vocabulary  verbalized  punctuation 
version  of  the  Wall  Street  Journal  task. 

1.  INTRODUCTION 

This  paper  presents  “dry  run”  results  of  work  done 
at  Dragon  Systems  on  the  Wall  Street  Journal  (WSJ) 
benchmark  task.  After  we  give  a  brief  description  of  our 
continuous  speech  recognition  system,  we  describe  the 
two  different  kinds  of  acoustic  models  that  were  used 
and  explain  how  they  were  trained.  Then  we  present 
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and  discuss  the  results  obtained  so  far  and  review  our 
plans  for  further  research. 

In  our  system  a  set  of  output  distributions,  known  as  the 
set  of  PELs  (phonetic  elements),  is  associated  with  each 
phoneme.  The  HMM  for  a  PIC  (phoneme-in-context) 
is  represented  as  a  linear  sequence  of  states,  each  hav¬ 
ing  an  output  distribution  chosen  from  the  set  of  PELs 
for  the  given  phoneme,  cind  a  (double  exponential)  dura¬ 
tion  distribution.  The  model  for  a  particular  hypothesis 
is  constructed  by  concatenating  the  necessary  sequence 
of  PICs,  based  on  the  specified  pronunciation  (sequence 
of  phonemes)  for  each  of  the  component  words.  Thus 
our  system  models  both  word-internal  and  cross-word 
co-Mticulation.  When  a  model  for  a  PIC  that  is  needed 
does  not  exist,  a  “backoff”  strategy  is  used,  whereby  the 
model  for  a  different,  but  related,  PIC  is  used  instead. 

The  two  methods  to  be  compared  in  this  paper  consti¬ 
tute  different  strategies  for  representing  and  training  the 
output  distributions  to  be  used  for  the  nodes  found  in  the 
PIC  models.  The  first  method  involves  generating  a  set 
of  (unimodal)  PELs  for  a  given  speaker  by  clustering  the 
hypothetical  frames  found  in  the  spectral  models  for  that 
speaker,  a  step  we  call  “rePELing” ,  and  then  construct¬ 
ing  speaker-dependent  PEL  sequences  to  represent  each 
PIC  as  an  HMM,  which  we  call  “respelling”.  The  spec¬ 
tral  model  for  a  PIC  can  be  thought  of  as  the  expected 
value  of  the  sequence  of  frames  that  would  be  generated 
by  the  PIC,  normalized  to  an  average  length.  The  sec¬ 
ond  method,  a  univariate  version  of  tied  mixtures,  rep¬ 
resents  the  probability  distribution  for  each  parameter 
in  a  PEL  as  a  mixture  of  a  fixed  set  of  unimodal  compo¬ 
nents,  the  mixing  weights  being  estimated  using  the  EM 
algorithm  [9].  In  both  the  RePELing/Respelling  and  the 
tied  mixture  models,  we  assume  that  the  parameters  are 
statistically  independent.  A  more  detailed  explanation 
of  these  two  methods  can  be  found  in  sections  3  and  4. 

2.  OVERVIEW  OF  DRAGON 
TRAINING  AND  RECOGNITION 

The  continuous  speech  recognition  system  developed  by 
Dragon  Systems  was  presented  at  the  June  1990  DARPA 
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SLS  meeting  ([5],  [6],  [11])  and  at  the  Februciry  1991 
DARPA  SLS  meeting  ([4]).  The  version  presented  in 
this  paper  is  speaker-dependent,  and  was  demonstrated 
to  be  capable  of  near  real-time  performance  on  a  1000- 
word  task  when  running  on  a  486-based  PC.  When  run¬ 
ning  live,  a  TMS320C25-based  board  performs  the  sig¬ 
nal  processing  and  the  speech  is  sampled  at  12kHz.  In 
the  experiments  reported  in  this  paper,  the  speech  was 
sampled  at  16kHz,  the  speech  waveforms  having  been 
supplied  in  a  standard  format  by  NIST. 

An  important  contribution  to  our  improved  performance 
in  the  last  year  was  our  switch  to  32  signal  processing 
parameters  (consisting  of  our  eight  original  spectral  pa¬ 
rameters  together  with  12  cepstral  parameters  and  their 
estimated  time  derivatives).  The  cepstral  parameters 
were  computed  via  an  inverse  Fourier  transform  of  the 
log  magnitude  spectrum.  At  recognition  time,  the  pa¬ 
rameters  are  computed  every  20  ms,  while  for  purposes 
of  training,  10  ms  data  was  used. 

The  recognition  algorithm  relies  on  frame-synchronous 
dynamic  programming  (an  implementation  of  the  for¬ 
ward  pass  of  the  Baum- Welch  algorithm)  to  extend  sen¬ 
tence  hypotheses  subject  to  the  elimination  of  poor  paths 
by  beam  pruning.  In  addition,  the  Continuous  Speech 
Recognizer  uses  the  DARPA-mandated  digram  language 
model  ([15]),  which  is  a  modification  of  the  backoff  al¬ 
gorithm  from  [13].  The  rapid  matcher,  as  described  in 
[11],  is  another  important  component  of  the  system.  For 
any  frame,  it  limits  the  number  of  word  candidates  that 
can  be  hypothesized  as  starting  at  that  frame.  For  pur¬ 
poses  of  this  paper,  which  is  primarily  concerned  with 
the  quality  of  our  modeling,  most  of  the  rapid  match  er¬ 
rors  have  been  eliminated  by  petssing  through  long  lists 
of  words  for  the  detailed  match  to  consider,  at  the  cost 
of  considerable  ^^dditionaJ  computation.  Similarly,  most 
of  the  pruning  errors  have  been  eliminated  by  running 
with  a  high  threshold.  A  companion  paper  [10],  that  ap¬ 
pears  in  this  volume,  describes  a  new  strategy  for  train¬ 
ing  the  rapid  match  models  directly  from  the  Hidden 
Markov  Models  specified  by  the  PICs.  This  new  strat¬ 
egy  shows  promise  for  reducing  the  average  length  of  the 
rapid  match  list  that  must  be  returned  at  any  given  time, 
and  thus,  speeding  up  the  recognizer. 

In  the  experiments  described  below,  models  were  trained 
for  each  of  the  12  speaker-dependent  Wall  Street  Jour¬ 
nal  speakers,  using  the  approximately  600  training  sen¬ 
tences  (300  with  verbalized  punctuation  and  300  with¬ 
out).  Testing  Wcis  done  using  the  approximately  40 
recorded  sentences  (per  speaker)  available  as  the  5000- 
word  closed-vocabulary  verbalized  punctuation  develop¬ 
ment  test  set. 


In  order  to  incorporate  context  information  at  the 
phoneme  level,  triphone  structures  were  constructed  that 
include  information  about  the  immediate  phonetic  en¬ 
vironment  that  affects  a  phoneme’s  acoustic  character. 
These  augmented  triphones,  called  “PIC”s,  are  the  fun¬ 
damental  unit  of  the  system,  and  are  closely  related  to 
other  approaches  that  have  appeared  in  the  literature 
([16]  and  [14]).  The  information  that  the  PICs  currently 
contain  is  the  identity  of  the  preceding  and  succeeding 
phonemes,  and,  optionally,  an  estimate  of  the  degree  of 
the  phoneme’s  prepausal  lengthening.  Each  PIC  is  rep¬ 
resented  acoustically  by  a  sequence  of  nodes.  Each  node 
is  taken  to  have  an  output  distribution  specified  by  a 
PEL,  and  a  duration  distribution.  PIC  models  repre¬ 
senting  the  same  phoneme  may  share  PELs,  but  PELs 
can  never  be  shared  across  phonemes.  The  parametric 
family  used  for  modeling  the  probability  distributions  of 
the  durations  as  well  as  of  the  individual  acoustic  param¬ 
eters  is  assumed  to  have  the  double  exponential  form 


where  fi  is  the  mean  and  <t  is  the  mean  absolute  devia¬ 
tion. 

A  detailed  description  of  the  original  models  for  PICs 
2ind  how  they  were  formerly  trained  can  be  found  in  [6]. 
The  following  sections  explain  how  a  variety  of  modi¬ 
fications  have  been  made  to  the  original  PIC  training 
algorithm. 

The  English  phoneme  edphabet  used  by  the  system  in¬ 
cludes  26  consonants  (including  the  syllabic  consonants, 
/L/,  /M/,  and  /N/)  and  three  levels  of  stress  for  each  of 
17  vowels,  constituting  a  total  of  77  phonemes.  Approxi¬ 
mately  10%  of  the  lexical  entries  for  the  5000- word  WSJ 
task  have  multiple  pronunciations,  because  of  stress  dif¬ 
ferences  in  the  vowels  and  expected  pronunciation  vari¬ 
ations. 

Of  course,  the  number  of  possible  PICs  that  can  ap¬ 
pear  in  hypotheses  at  recognition  time  (including  cross¬ 
word  PICs)  is  vast  compared  to  the  number  of  PICs  that 
typically  appear  in  600  sentences  of  Wall  Street  Journal 
training  data.  This  paper  reports  results  when  around 
35,000  PICs  are  built  for  the  rePEL/respell  models  and 
when  around  14,000  PICs  are  built  for  the  tied  mixture 
models.  When  the  recognizer  asks  for  a  model  for  a  PIC 
that  has  not  been  built,  a  backoff  strategy  is  invoked 
which  supplies  a  model  for  a  related  PIC  instead. 
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3.  REPELING/RESPELLING 

In  earlier  reports  [6],  [7],  we  described  a  straightforward 
procedure  that  generated  speaker-dependent  models  via 
several  passes  of  adaptation  of  the  reference  speaker’s 
models.  The  adaptation  process  modified  the  PEL  prob¬ 
ability  distributions  and  the  PIC-dependent  duration 
distributions.  However,  no  new  PELs  were  created,  nor 
was  the  PEL  sequence  for  a  given  PIC  allowed  to  change. 
The  sharing  of  PELs  by  different  PICs  was  determined 
by  the  acoustics  of  the  reference  speaker’s  speech,  and 
was  assumed  to  generalize  to  other  speakers. 

At  the  last  SLS  meeting  in  Feb  1991  [4],  we  reported  on 
a  method  for  choosing  the  sequence  of  PELs  for  a  PIC 
in  a  speaker-dependent  fashion,  essentially  in  the  same 
manner  as  had  been  done  for  the  reference  speaker.  This 
step  could  be  performed  once  the  original  PELs  had  been 
adapted  using  the  reference  speaker’s  PIC  spellings.  To 
the  extent  that  differences  in  PEL  sequences  for  a  given 
PIC  can  reflect  different  choices  of  allophones,  this  ex¬ 
tra  step  can  capture  allophonic  v2iriation  among  differ¬ 
ent  speakers,  and  lifts  the  restriction  that  the  sharing  of 
PELs  be  the  same  for  all  speakers.  This  change  produced 
a  significant  improvement  in  performance. 

In  order  to  take  full  advantage  of  our  new  more  infor¬ 
mative  signal  processing  parameters,  however,  a  further 
change  was  required.  We  needed  to  construct  a  new  set 
of  PELs  to  serve  as  the  class  of  output  distributions  for 
the  HMMs  to  be  constructed.  It  was  not  adequate  to 
simply  extend,  by  adaptation,  the  8  parameter  PELs  we 
had  been  working  with,  to  32  pau-ameter  PELs,  as  this 
would  prevent  us  from  making  distinctions  that  could 
not  even  be  seen  with  the  old  signal  processing. 

In  the  previous  reports  [6]  and  [4],  we  described  how 
a  set  of  PELs  for  the  reference  speaker  was  initially 
hand-constructed  while  running  an  interactive  program 
for  “labeling”  spectrograms  of  the  reference  speaker’s 
speech.  We  needed  to  be  able  to  construct  a  new  set 
of  PELs  automatically;  thus,  we  implemented  a  k-means 
clustering  algorithm  whose  purpose  wais  to  create  a  new 
set  of  (32  parameter)  PELs  for  each  speaker  whose  mod¬ 
els  were  to  be  trained.  This  step  involved  clustering 
the  fi'ames  in  the  “spectral  models”  for  all  of  the  PICs 
to  be  constructed  for  that  phoneme.  A  spectral  model 
for  a  PIC  is  obtained  by  performing  linear  stretching 
and  shrinking  operations  on  PIC  tokens  (examples  of  the 
given  PIC  and  of  related  PICs,  available  from  a  prior  seg¬ 
mentation  of  the  training  data,  bcised  on  the  best  models 
then  available)  and  then  averaging  the  resulting  trans¬ 
formed  tokens  (which  have  a  common  length),  to  obtain 
a  kind  of  “expected”  PIC  token. 

The  primary  motivation  behind  the  rePELing  step  was 


to  make  it  likely  that  each  spectral  frame  would  have  at 
least  one  PEL  that  matched  it  fairly  well.  As  each  of  the 
77  phonemes  was  limited  to  having  only  63  PELs  avail¬ 
able  for  building  PICs,  about  4500  PELs  were  created 
per  speaker. 

Once  the  new  set  of  PELs  had  been  created,  a  dy¬ 
namic  programming  algorithm  was  used  for  converting 
the  spectreil  model  to  an  HMM  containing  up  to  six 
nodes,  with  each  node  assigned  a  PEL  and  a  duration 
distribution.  This  respelling  step  drew  on  about  4000  of 
the  4500  PELs  in  constructing  the  HMMs. 

A  summary  of  the  overall  training  procedure  is  outlined 
below,  with  rePELing  and  respelling  appearing  as  steps 
4  and  5: 

1.  Six  passes  of  adaptation  were  run  on  each  speaker’s 
training  data,  starting  with  the  reference  speaker’s 
models,  using  the  old  8  parameter  signal  processing. 

2.  Segmentation  of  each  speaker’s  data  was  performed, 
using  the  best  available  models  (originally,  those 
produced  in  step  1). 

3.  Spectral  models  were  built  for  each  PIC,  using  all 
32  parameters,  based  on  the  segmentation  in  step  2. 

4.  RePELing  was  done  for  each  speaker  in  order  to 
generate  a  speaker-dependent  set  of  output  distri¬ 
butions. 

5.  For  each  speaker,  respelling  was  performed  to  de¬ 
termine  the  PEL  sequences  that  would  be  used  in 
the  resulting  HMMs. 

6.  For  each  speaker,  one  additional  pass  of  adapta¬ 
tion  was  performed  in  order  to  better  estimate  the 
mean  absolute  deviations  for  each  parameter  for 
each  PEL. 

7.  Steps  2-6  could  then  be  repeated,  if  desired. 
Results  for  this  method  appear  in  section  5. 

4.  TIED  MIXTURES 

Were  the  model  described  in  section  3  correct,  the  32 
pareuneters  in  each  acoustic  frame  corresponding  to  a 
given  PEL  would  be  distributed  as  if  they  were  gener¬ 
ated  by  32  independent  (unimodal)  double  exponential 
distributions.  However,  graphical  displays  reveal  that 
the  frame  distributions  for  many  PELs  have  multiple 
modes.  Furthermore,  it  is  well  known  that  the  parame¬ 
ters  within  a  frame  are  correlated.  In  order  to  deal  with 
the  multimodality  of  the  data  and  to  capture  the  de¬ 
pendence  among  parameters.  Dragon  has  implemented 
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a  modeling  strategy  in  which  the  output  distributions 
are  represented  in  a  more  flexible  way.  This  represen¬ 
tation,  similar  to  other  tied  mixture  models  developed 
elsewhere  ([8],  [12]),  also  provides  the  basis  for  achieving 
spesiker  independence. 

If  we  divide  the  parameters  into  groups  or  “streams”, 
with  the  property  that  parameters  in  different  streams 
can  be  assumed  to  be  independent,  then  our  new  mod¬ 
eling  strategy  represents  the  probability  of  a  frame  in 
a  given  state  as  the  product  of  probability  densities  for 
each  stream,  and  the  probability  density  for  a  stream  is 
assumed  to  be  a  mixture  distribution  over  a  fixed  set  of 
basis  distributions  specific  to  the  stream. 

More  formeJly,  we  let  f(x)  represent  the  probability  den¬ 
sity  of  a  PEL,  where  x  is  a  frame,  and  we  assume  that 
f(x)  is  the  product  of  s  probability  densities,  /i(x,),  one 
for  each  stream: 

«=i 

Furthermore,  we  assume  that  each  /,•  can  be  represented 
in  terms  of  a  set  of  basis  distributions  Qij : 

Ci 

ft  =  ^ij  9ij  I 
i=i 

where  Ci  is  the  number  of  components  for  stream  i. 

At  the  present  time,  we  are  using  32  streams;  i.e.,  eeich 
parameter  is  assumed  to  be  statistically  independent  of 
every  other  peirameter  in  a  given  state.  We  have  as¬ 
sumed  the  32  parameters  to  be  independent  both  as  a 
way  of  relating  our  new  results  to  our  old  results  (which 
were  also  based  on  the  same  strong  independence  as¬ 
sumption),  and  as  a  debugging  tool.  We  chose  our  ba¬ 
sis  distributions  to  be  equally  spaced  double  exponential 
distributions  with  a  fixed  mean  absolute  deviation,  ar¬ 
ranged  so  as  to  cover  the  full  range  of  each  parameter. 
Thus,  when  a  mixture  distribution  was  estimated,  it  was 
easy  to  see  what  values  in  the  spcice  were  relatively  likely 
or  unlikely.  In  the  system  reported  here,  the  set  of  basis 
components  is  the  same  for  each  stream,  which  would 
not  be  the  case  in  a  more  general  setting. 

The  tied  mixture  PIC  models  were  assumed  to  be  ei¬ 
ther  1-node  or  2-node  models,  with  the  number  of  nodes 
being  determined  based  on  the  proportion  of  very  short 
PIC  tokens.  At  the  present  time,  no  PEL  is  used  as  an 
output  distribution  for  more  than  one  node.  Each  tied 
mixture  PIC  model  w^ls  built  via  the  EM  algorithm  from 


instances  of  the  given  PIC  found  in  the  training  data  for 
the  given  speaker  (based  on  segmentations  obtained  us¬ 
ing  the  best  available  models).  Unfortunately,  most  of 
the  PICs  that  occur  in  the  training  data  occur  very  few 
times,  and,  not  surprisingly,  most  of  the  PICs  that  could 
in  principle  occur  never  in  feict  do. 

Thus,  two  key  problems  that  must  be  solved  in  training 
the  recognizer  are  (1)  the  smoothing  problem  and  (2) 
the  backoff  problem.  The  maocimum  likelihood  estima¬ 
tor  (MLE),  together  with  many  related  asymptotically 
efficient  estimators,  heis  the  defect  of  being  a  rather  poor 
estimator  when  it  is  given  only  a  small  amount  of  data  to 
work  with:  think  of  estimating  the  probability  of  “heads” 
from  only  one  coin  flip.  Thus,  it  is  important  to  smooth 
the  MLE  when  there  is  clearly  an  insufficient  supply  of 
data.  We  have  chosen  to  implement  a  smoothing  al¬ 
gorithm  with  a  strong  Bayesian  flavor.  In  this  paper  we 
will  not  address  the  backoff  problem  in  any  detail;  at  the 
present  time,  when  we  do  not  have  a  model  for  a  PIC 
available  to  the  recognizer,  we  substitute  a  “generic”  PIC 
model,  which  has  less  specific  context  information. 

The  Bayesian  solution  to  the  coin  flip  problem  amounts 
to  representing  the  prior  information  we  may  have  about 
the  probability  of  “heads”  as  a  prior  number  of  flips,  of 
which  a  certain  number  are  taken  to  be  heads,  and  then 
combining  those  “prior”  flips  with  the  real  flips.  We  have 
taken  a  similar  approach  to  the  problem  of  estimating 
the  mixing  probabilities  in  our  tied  mixture  models.  We 
build  the  more  common  PICs  before  we  build  the  less 
common  PICs  (see  below).  At  the  time  that  we  are  ready 
to  build  a  given  PIC,  we  make  our  best  judgement  as  to 
what  the  mixing  probabilities  are  for  each  stream  of  each 
state  in  the  PIC.  This  guess  is  based  on  the  models  that 
have  already  been  built  for  related  PICs.  Not  only  do 
we  guess  the  mixing  probabilities,  but  we  also  make  a 
judgement  about  the  “relevance”  of  our  estimate,  which 
is  to  say,  the  number  of  fr^lmes  of  real  data  that  we 
judge  our  guess  to  be  worth.  We  then  use  these  prior 
estimates  to  initialize  the  EM  algorithm,  and  in  addition, 
we  combine  the  accumulated  fractional  counts  for  each 
mixture  component  with  the  prior  counts  based  on  our 
prior  guess,  in  forming  the  estimate  to  be  used  during 
the  next  iteration.  Thus  we  have  as  our  re-estimation 
formula: 

where  AJ)  is  the  a  priori  estimate  based  on  the  PICs 
that  have  already  been  built,  k  is  the  relevance  of  this 
estimate,  and  n<j  is  the  accumulated  fractional  count  for 
the  jth  component  when  estimating  the  distribution  for 
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the  ith  parameter  in  a  given  node. 

PICs  are  currently  built  in  a  prescribed  order  in  our  sys¬ 
tem:  we  build  those  for  which  there  is  the  most  data  first. 
Thus,  we  begin  by  building  the  doubly-sided  generic 
PICs,  i.e.  models  for  phonemes  averaged  over  all  left 
and  right  contexts.  Then  we  move  on  to  build  singly- 
sided  generic  PICs,  i.e.  models  for  phonemes  where  the 
context  is  specified  only  on  the  right  or  on  the  left;  we 
use  the  doubly  generic  PIC  models  to  smooth  the  mod¬ 
els  for  the  singly  generic  ones.  Finally  we  build  our  fully 
contextual  PICs,  but  ageiin  we  build  the  most  common 
ones  first,  using  the  doubly  and  singly  generic  PICs  to 
smooth  the  fully  contextual  ones.  When  building  a  rel¬ 
atively  uncommon  fully  contextual  PIC,  it  is  useful  to 
smooth  the  model  using  models  of  related  fully  contex¬ 
tual  PICs  which  share  some  of  the  context  or  have  closely 
related  contexts. 

5.  RESULTS  ON  WSJ  DATA 

This  section  contains  results  on  the  5000-word  closed- 
vocabulary  speaker-dependent  verbalized  punctuation 
version  of  the  Wall  Street  Journal  task,  using  the  devel¬ 
opment  test  data.  Table  1  lists  results  for  all  the  WSJ 
speakers,  displaying  the  word  error  rates  using  three  dif¬ 
ferent  models.  The  first  column  contains  the  results  of 
the  first  recognition  run  we  did  using  models  obtained  by 
merely  adapting  our  reference  speaker’s  original  models, 
using  our  old  8  parameter  signal  processing,  yielding  an 
overall  word  error  rate  of  35.6%.  The  second  column  con¬ 
tains  our  best  32  parameter  unimodal  models  using  the 
rePELing/respelling  training  strategy,  after  several  iter¬ 
ations  of  training,  with  an  overall  error  rate  of  16.4%.  Fi¬ 
nally  the  last  column  contains  the  results  of  our  first  ex¬ 
periment  recognizing  Wall  Street  Journal  sentences  with 
the  32  stream  tied  mixture  models  described  above,  but 
based  on  only  one  segmentation  step  (segmentation  into 
phonemes).  This  produced  a  word  error  rate  of  14.8%. 
It  is  encouraging  that  the  tied  mixture  models  yielded 
better  performance  than  did  the  unimodal  models  on  11 
out  of  the  12  speakers,  given  that  there  has  not  yet  been 
any  opportunity  for  parameter  optimization. 

6.  CONCLUSIONS 

The  training  paradigm  outlined  above  in  the  description 
of  our  tied  mixture  modeling  has  only  recently  been  fully 
implemented  at  Dragon.  Many  aspects  of  the  training 
strategy  await  full  exploration,  but  the  early  results  we 
have  described  are  very  encouraging.  Already  we  have 
improved  our  performance  relative  to  our  old  modeling 
and  training  paradigms. 

In  the  coming  months  we  plan  to  focus  on  a  number 
of  different  aspects  of  training.  First,  we  will  be  con- 


Speaker 

Adapt 

only 

RePEL 

Respell 

Tied 

Mixtures 

001 

24.8 

8.0 

6.7 

002 

21.9 

9.6 

6.7 

OOA 

64.5 

24.6 

26.8 

OOB 

36.8 

22.3 

21.8 

OOC 

47.1 

28.6 

28.1 

OOD 

56.5 

23.0 

20.5 

OOF 

43.3 

20.6 

16.9 

203 

27.1 

14.4 

13.9 

400 

27.6 

12.5 

12.4 

430 

30.5 

13.5 

9.6 

431 

29.4 

14.5 

9.8 

432 

18.3 

5.5 

4.7 

AVG 

35.6 

16.4 

14.8 

Table  1:  Summary  of  Wall  Street  Journal  Results. 
5000-word  speaker-dependent  closed-vocabulary 
development  test  set  word  error  rate  (%)  using  verbalized 
punctuation. 

structing  basis  distributions  for  streams  with  more  than 
one  parameter  and  studying  the  effect  of  this  modeling 
on  performance.  We  anticipate  that  we  should  obtain 
improved  performance  as  we  will  then  be  modeling  the 
dependence  among  parameters  in  an  individual  frame. 
We  will  also  be  studying  a  variety  of  backoff  strategies, 
which  involve  substituting  fully  contextual  PICs  instead 
of  generic  PICs,  when  a  PIC  model  has  not  been  built. 
Another  issue  of  importance  will  be  the  nature  of  our 
Bayesian  smoothing,  which  we  hope  to  implement  in  a 
more  “data  driven”  way.  Furthermore,  we  expect  that 
the  use  of  tied  mixture  modeling  will  allow  us  to  develop 
a  high-performance  speaker-independent  recognizer,  an 
important  goal  for  the  coming  year. 
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